Mapping the MiSeq sequences of April 2015 (ASO knockdowns and time course) to the genome. RNA was obtained from the THP-1 PMA stimulation time course and the ASO knockdown of CASPAR9_1 by Jens Stolte, and knockdown of MYB CASPAR and GFI1 CASPAR by Yuri Ishizu. RNA sample IDs 4593-4606. MiSeq paired-end run 150410_M00528_0116_000000000-AD0NU.

# 1. Retrieve the MiSeq sequencing data

Create a directory for the MiSeq data analysis:
```
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Fastq
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping
```

Set some environment variables:
```
export RUN_ID=150410_M00528_0116_000000000-AD0NU
export SEQ_DIR=/sequencedata/MiSeq/$RUN_ID/Data/Intensities/BaseCalls
export MiSeq_SUMMARY=$SEQ_DIR/Alignment/
```

Copy the sequencing job information to the local directory:
```
for FILE in CompletedJobInfo.xml DemultiplexSummaryF1L1.txt \
            GenerateFASTQRunStatistics.xml SampleSheetUsed.csv
  do install -m 664 $MiSeq_SUMMARY/$FILE . ; done
```

Create links to the sequencing data, and remove sequences for which the index sequence could not be determined and demultiplexing failed:
```
for file in $SEQ_DIR/*fastq.gz ; do ln -sf $file ; done
rm Undetermined_S0_L001_R?_001.fastq.gz
```

# 2. Run TagDust on the MiSeq sequences

Run `tagdust` on the demultiplexed sequences to separate the data by barcode.
```
tagdust --version
```
reports `Tagdust 2.13`.
```
mkdir -p tagdust
python make_tagdust_scripts.py 
bash -x script.sh
```
Rename the generated sequences to include their sample name:
```
tail -n +2 multiplex.txt |
  cut -f 3,4,6 |
  while read name bc number
  do
    echo "Sorting and renaming ${number}_BC_${bc}_READ1.fq to ${name}_READ1.fq"
    cat "${number}_BC_${bc}_READ1.fq" | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" > "${name}_READ1.fq"
    rm "${number}_BC_${bc}_READ1.fq"
    echo "Sorting and renaming ${number}_BC_${bc}_READ2.fq to ${name}_READ2.fq"
    cat "${number}_BC_${bc}_READ2.fq" | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" > "${name}_READ2.fq"
    rm "${number}_BC_${bc}_READ2.fq"
  done
```
which outputs
```
Sorting and renaming 1_BC_CAC_READ1.fq to t00_r1_READ1.fq
Sorting and renaming 1_BC_CAC_READ2.fq to t00_r1_READ2.fq
Sorting and renaming 1_BC_AGT_READ1.fq to t00_r2_READ1.fq
Sorting and renaming 1_BC_AGT_READ2.fq to t00_r2_READ2.fq
Sorting and renaming 1_BC_GCG_READ1.fq to t00_r3_READ1.fq
Sorting and renaming 1_BC_GCG_READ2.fq to t00_r3_READ2.fq
Sorting and renaming 2_BC_CAC_READ1.fq to t01_r1_READ1.fq
Sorting and renaming 2_BC_CAC_READ2.fq to t01_r1_READ2.fq
Sorting and renaming 2_BC_AGT_READ1.fq to t01_r2_READ1.fq
Sorting and renaming 2_BC_AGT_READ2.fq to t01_r2_READ2.fq
Sorting and renaming 2_BC_GCG_READ1.fq to t01_r3_READ1.fq
Sorting and renaming 2_BC_GCG_READ2.fq to t01_r3_READ2.fq
Sorting and renaming 3_BC_CAC_READ1.fq to t04_r1_READ1.fq
Sorting and renaming 3_BC_CAC_READ2.fq to t04_r1_READ2.fq
Sorting and renaming 3_BC_AGT_READ1.fq to t04_r2_READ1.fq
Sorting and renaming 3_BC_AGT_READ2.fq to t04_r2_READ2.fq
Sorting and renaming 3_BC_GCG_READ1.fq to t04_r3_READ1.fq
Sorting and renaming 3_BC_GCG_READ2.fq to t04_r3_READ2.fq
Sorting and renaming 4_BC_CAC_READ1.fq to t12_r1_READ1.fq
Sorting and renaming 4_BC_CAC_READ2.fq to t12_r1_READ2.fq
Sorting and renaming 4_BC_AGT_READ1.fq to t12_r2_READ1.fq
Sorting and renaming 4_BC_AGT_READ2.fq to t12_r2_READ2.fq
Sorting and renaming 4_BC_GCG_READ1.fq to t12_r3_READ1.fq
Sorting and renaming 4_BC_GCG_READ2.fq to t12_r3_READ2.fq
Sorting and renaming 5_BC_CAC_READ1.fq to t24_r1_READ1.fq
Sorting and renaming 5_BC_CAC_READ2.fq to t24_r1_READ2.fq
Sorting and renaming 5_BC_AGT_READ1.fq to t24_r2_READ1.fq
Sorting and renaming 5_BC_AGT_READ2.fq to t24_r2_READ2.fq
Sorting and renaming 5_BC_GCG_READ1.fq to t24_r3_READ1.fq
Sorting and renaming 5_BC_GCG_READ2.fq to t24_r3_READ2.fq
Sorting and renaming 6_BC_CAC_READ1.fq to t96_r1_READ1.fq
Sorting and renaming 6_BC_CAC_READ2.fq to t96_r1_READ2.fq
Sorting and renaming 6_BC_AGT_READ1.fq to t96_r2_READ1.fq
Sorting and renaming 6_BC_AGT_READ2.fq to t96_r2_READ2.fq
Sorting and renaming 6_BC_GCG_READ1.fq to t96_r3_READ1.fq
Sorting and renaming 6_BC_GCG_READ2.fq to t96_r3_READ2.fq
Sorting and renaming 7_BC_ATG_READ1.fq to c91_r1_READ1.fq
Sorting and renaming 7_BC_ATG_READ2.fq to c91_r1_READ2.fq
Sorting and renaming 7_BC_TAC_READ1.fq to c91_r2_READ1.fq
Sorting and renaming 7_BC_TAC_READ2.fq to c91_r2_READ2.fq
Sorting and renaming 7_BC_GCT_READ1.fq to c91_r3_READ1.fq
Sorting and renaming 7_BC_GCT_READ2.fq to c91_r3_READ2.fq
Sorting and renaming 8_BC_ATG_READ1.fq to n91_r1_READ1.fq
Sorting and renaming 8_BC_ATG_READ2.fq to n91_r1_READ2.fq
Sorting and renaming 8_BC_TAC_READ1.fq to n91_r2_READ1.fq
Sorting and renaming 8_BC_TAC_READ2.fq to n91_r2_READ2.fq
Sorting and renaming 8_BC_GCT_READ1.fq to n91_r3_READ1.fq
Sorting and renaming 8_BC_GCT_READ2.fq to n91_r3_READ2.fq
Sorting and renaming 9_BC_ATG_READ1.fq to lip_r1_READ1.fq
Sorting and renaming 9_BC_ATG_READ2.fq to lip_r1_READ2.fq
Sorting and renaming 9_BC_TAC_READ1.fq to cel_r1_READ1.fq
Sorting and renaming 9_BC_TAC_READ2.fq to cel_r1_READ2.fq
Sorting and renaming 9_BC_GCT_READ1.fq to neg_r1_READ1.fq
Sorting and renaming 9_BC_GCT_READ2.fq to neg_r1_READ2.fq
Sorting and renaming 10_BC_ATG_READ1.fq to myb_r1_READ1.fq
Sorting and renaming 10_BC_ATG_READ2.fq to myb_r1_READ2.fq
Sorting and renaming 10_BC_TAC_READ1.fq to myb_r2_READ1.fq
Sorting and renaming 10_BC_TAC_READ2.fq to myb_r2_READ2.fq
Sorting and renaming 10_BC_GCT_READ1.fq to myb_r3_READ1.fq
Sorting and renaming 10_BC_GCT_READ2.fq to myb_r3_READ2.fq
Sorting and renaming 1_BC_ATG_READ1.fq to gfi_r1_READ1.fq
Sorting and renaming 1_BC_ATG_READ2.fq to gfi_r1_READ2.fq
Sorting and renaming 1_BC_TAC_READ1.fq to gfi_r2_READ1.fq
Sorting and renaming 1_BC_TAC_READ2.fq to gfi_r2_READ2.fq
Sorting and renaming 1_BC_GCT_READ1.fq to gfi_r3_READ1.fq
Sorting and renaming 1_BC_GCT_READ2.fq to gfi_r3_READ2.fq
Sorting and renaming 2_BC_ATG_READ1.fq to nkd_r1_READ1.fq
Sorting and renaming 2_BC_ATG_READ2.fq to nkd_r1_READ2.fq
Sorting and renaming 2_BC_TAC_READ1.fq to nkd_r2_READ1.fq
Sorting and renaming 2_BC_TAC_READ2.fq to nkd_r2_READ2.fq
Sorting and renaming 2_BC_GCT_READ1.fq to nkd_r3_READ1.fq
Sorting and renaming 2_BC_GCT_READ2.fq to nkd_r3_READ2.fq
```
Confirm the number of extracted sequences in `*_r[123]_READ[12].fq` and unextracted sequences in `*_un_READ[12].fq`:
```
wc *_r[123]_READ[12].fq > tagdust.extracted
wc *_un_READ[12].fq > tagdust.unextracted
grep extracted tagdust/* > tagdust.log
```
Check the tagdust log files in the `tagdust` subdirectory if there were any error messages.
Extraction rates ranges from 96.3% to 98.2%. Remove this directory and the script files:
```
rm -r tagdust
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```
Remove the files with unextracted sequences:
```
rm *_un_READ[12].fq
```
Also remove the linked fastq files:
```
rm *_S*_L001_R[12]_001.fastq.gz
```
Compress and save the Fastq files:
```
gzip *.fq
mv *.fq.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Fastq/
```

Count the number of paired reads in each sample:
```
python count_sequences.py
```
generating these counts:

| condition | replicate | number of paired reads |
| --------- | --------- | ---------------------- |
|    t00    |     r1    |   311097               |
|    t00    |     r2    |   311192               |
|    t00    |     r3    |   304059               |
|    t01    |     r1    |   497953               |
|    t01    |     r2    |   417687               |
|    t01    |     r3    |   567353               |
|    t04    |     r1    |   408071               |
|    t04    |     r2    |   181007               |
|    t04    |     r3    |   432629               |
|    t12    |     r1    |   526797               |
|    t12    |     r2    |   259555               |
|    t12    |     r3    |   267750               |
|    t24    |     r1    |   387899               |
|    t24    |     r2    |   268074               |
|    t24    |     r3    |   137045               |
|    t96    |     r1    |   544532               |
|    t96    |     r2    |   158561               |
|    t96    |     r3    |   264786               |
|    c91    |     r1    |   166030               |
|    c91    |     r2    |   291285               |
|    c91    |     r3    |   227677               |
|    n91    |     r1    |   140126               |
|    n91    |     r2    |   191037               |
|    n91    |     r3    |   295903               |
|    lip    |     r1    |   135468               |
|    cel    |     r1    |   258725               |
|    neg    |     r1    |     4114               |
|    myb    |     r1    |   160097               |
|    myb    |     r2    |   356109               |
|    myb    |     r3    |    82761               |
|    gfi    |     r1    |    86296               |
|    gfi    |     r2    |    97114               |
|    gfi    |     r3    |   333210               |
|    nkd    |     r1    |   135496               |
|    nkd    |     r2    |   203011               |
|    nkd    |     r3    |   116032               |

Note that `neg`, `r1` was a library prepared without the 3' linker as a negative control for the protocol.

The number of unique reads is much lower than the number of reads. Make a list of unique reads for mapping:
```
python make_unique_sequence_list.py
```
This will create the Fasta files `seqlist_READ1.fa` and `seqlist_READ2.fa` with unique sequences for READ1 and READ2, as well as index files `<library>.index.txt` specifying which of the unique sequences each sequenced read contains.

# 3. Filter against chrM, ribosomal RNA, and tRNAs

## 3.1 Filter against chrM:

Create the scripts to filter against mitochondrial DNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py chrM 100
bash -x script.sh
```
This will run
```
python filter.py chrM READ1 <start_index> <end_index>
python filter.py chrM READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the chrM forward and reverse genome sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py chrM
```
Convert the coordinates of the forward and reverse strand of chrM to genomic coordinates:
```
for file in *.chrM.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Compress and store the results:
```
gzip *.chrM.psl
mv *.chrM.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.chrM.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```
Remove the intermediate files:
```
rm chrM.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 3.2 Filter against ribosomal RNA:

Create the scripts to filter against ribosomal RNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py rRNA 100
bash -x script.sh
```
This will run
```
python filter.py rRNA READ1 <start_index> <end_index>
python filter.py rRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the ribosomal RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py rRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.rRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.rRNA.psl
mv *.rRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.rRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm rRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 3.3 Filter against transfer RNA:

Create the scripts to filter against transfer RNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py tRNA 100
bash -x script.sh
```
This will run
```
python filter.py tRNA READ1 <start_index> <end_index>
python filter.py tRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the transfer RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py tRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.tRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.tRNA.psl
mv *.tRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.tRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm tRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

# 4. Align against known transcripts

## 4.1 Small nuclear RNAs (spliceosomal RNAs)

Create the scripts to filter against small nuclear RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snRNA 100
bash -x script.sh
```
This will run
```
python filter.py snRNA READ1 <start_index> <end_index>
python filter.py snRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small nuclear RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.snRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.snRNA.psl
mv *.snRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.snRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm snRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.2 Small cytoplasmic RNAs (7SL RNAs and Brain cytoplasmic RNA 1)

Create the scripts to filter against small cytoplasmic RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py scRNA 100
bash -x script.sh
```
This will run
```
python filter.py scRNA READ1 <start_index> <end_index>
python filter.py scRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small cytoplasmic RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py scRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.scRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

```
gzip *.scRNA.psl
mv *.scRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.scRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm scRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.3 Small nucleolar RNAs

Create the scripts to filter against small nucleolar RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snoRNA 100
bash -x script.sh
```
This will run
```
python filter.py snoRNA READ1 <start_index> <end_index>
python filter.py snoRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small nucleolar RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snoRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.snoRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.snoRNA.psl
mv *.snoRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.snoRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm snoRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.4 Ro-associated RNAs Y1/Y3/Y4/Y5

Create the scripts to filter against Ro-associated RNA sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py yRNA 100
bash -x script.sh
```
This will run
```
python filter.py yRNA READ1 <start_index> <end_index>
python filter.py yRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the Ro-associated RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py yRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.yRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping results:
```
gzip *.yRNA.psl
mv *.yRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.yRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm yRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.5 Histone genes

Create the scripts to filter against histone mRNA sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py histone 100
bash -x script.sh
```
This will run
```
python filter.py histone READ1 <start_index> <end_index>
python filter.py histone READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the histone mRNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py histone
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.histone.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping results:
```
gzip *.histone.psl
mv *.histone.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.histone.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm histone.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.6 RNA component of mitochondrial RNA processing endoribonuclease (RMRP)

Create the scripts to filter against the RMRP transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py RMRP 100
bash -x script.sh
```
This will run
```
python filter.py RMRP READ1 <start_index> <end_index>
python filter.py RMRP READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the RMRP transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py RMRP
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.RMRP.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping results:
```
gzip *.RMRP.psl
mv *.RMRP.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.RMRP.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm RMRP.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.7 Small Cajal body-specific RNAs

Create the scripts to filter against the small Cajal body-specific RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py scaRNA 100
bash -x script.sh
```
This will run
```
python filter.py scaRNA READ1 <start_index> <end_index>
python filter.py scaRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small Cajal body-specific RNAs. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py scaRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.scaRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```


Store the mapping results:
```
gzip *.scaRNA.psl
mv *.scaRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.scaRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm scaRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.8 RNA component of the RNase P ribonucleoprotein

Create the scripts to filter against the RPPH transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py RPPH 100
bash -x script.sh
```
This will run
```
python filter.py RPPH READ1 <start_index> <end_index>
python filter.py RPPH READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the RPPH transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py RPPH
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.RPPH.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.RPPH.psl
mv *.RPPH.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.RPPH.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm RPPH.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.9 Small ILF3/NF90-associated RNAs

Create the scripts to filter against the small ILF3/NF90-associated RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snar 100
bash -x script.sh
```
This will run
```
python filter.py snar READ1 <start_index> <end_index>
python filter.py snar READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small ILF3/NF90-associated RNAs. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snar
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.snar.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.snar.psl
mv *.snar.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.snar.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm snar.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.10 Telomerase RNA component (TERC)

Create the scripts to filter against the TERC transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py TERC 100
bash -x script.sh
```
This will run
```
python filter.py TERC READ1 <start_index> <end_index>
python filter.py TERC READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the TERC transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py TERC
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.TERC.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.TERC.psl
mv *.TERC.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.TERC.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm TERC.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.11 Vault RNAs

Create the scripts to filter against vault RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py vRNA 100
bash -x script.sh
```
This will run
```
python filter.py vRNA READ1 <start_index> <end_index>
python filter.py vRNA READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the vault RNAs. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py vRNA
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.vRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.vRNA.psl
mv *.vRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.vRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm vRNA.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.12 Metastatis associated lung adenocarcinoma transcript 1

Create the scripts to filter against the MALAT1 transcript sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py MALAT1 100
bash -x script.sh
```
This will run
```
python filter.py MALAT1 READ1 <start_index> <end_index>
python filter.py MALAT1 READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the MALAT1 transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py MALAT1
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.MALAT1.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.MALAT1.psl
mv *.MALAT1.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.MALAT1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm MALAT1.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 4.13 Small nucleolar RNA host genes

Create the scripts to filter against small nucleolar RNA host gene transcript sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snhg 100
bash -x script.sh
```
This will run
```
python filter.py snhg READ1 <start_index> <end_index>
python filter.py snhg READ2 <start_index> <end_index>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against the small nucleolar RNA host gene transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snhg
```

Convert the transcript coordinates to genomic coordinates:
```
for file in *.snhg.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```

Store the mapping results:
```
gzip *.snhg.psl
mv *.snhg.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.snhg.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm snhg.READ?.*-*.psl
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

# 5. Align against known transcripts using BWA

## 5.1 Messenger RNAs

Create a `.2bit` file with the mRNA sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa mRNA.2bit
```
Use BWA to map all MiSeq sequences to the messenger RNAs:
```
python make_alignment_scripts.py mRNA 100
bash -x script.sh
```
This will run
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa seqlist_READ1_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> mRNA.READ1_<number>.out | sort -k 14 | pslRecalcMatch stdin mRNA.2bit seqlist_mRNA_<start_index>_<end_index>.fa stdout | sort -k 10 > mRNA.READ1.<start_index>-<end_index>.psl
```
and
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa seqlist_READ2_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> mRNA.READ2_<number>.out | sort -k 14 | pslRecalcMatch stdin mRNA.2bit seqlist_mRNA_<start_index>_<end_index>.fa stdout | sort -k 10 > mRNA.READ2.<start_index>-<end_index>.psl
```
which uses BWA to map each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against mRNA transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py mRNA
```
generating merged files `<library>.mRNA.psl`.

Convert the transcript coordinates to genomic coordinates:
```
for file in *.mRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping result:
```
gzip *.mRNA.psl
mv *.mRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.mRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm mRNA.2bit
rm seqlist_READ?_*_*.fa
rm mRNA.READ?.*-*.psl
rm mRNA.READ?_*.out
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 5.2 Long non-coding RNAs

Create a `.2bit` file with the lncRNA sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa lncRNA.2bit
```
Use BWA to map all MiSeq sequences to long non-coding RNAs:
```
python make_alignment_scripts.py lncRNA 100
bash -x script.sh
```
This will run
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa seqlist_READ1_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> lncRNA.READ1_<number>.out | sort -k 14 | pslRecalcMatch stdin lncRNA.2bit seqlist_lncRNA_<start_index>_<end_index>.fa stdout | sort -k 10 > lncRNA.READ1.<start_index>-<end_index>.psl
```
and
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa seqlist_READ2_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> lncRNA.READ2_<number>.out | sort -k 14 | pslRecalcMatch stdin lncRNA.2bit seqlist_lncRNA_<start_index>_<end_index>.fa stdout | sort -k 10 > lncRNA.READ2.<start_index>-<end_index>.psl
```
which uses BWA to map each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against lncRNA transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py lncRNA
```
generating merged files `<library>.lncRNA.psl`.

Convert the transcript coordinates to genomic coordinates:
```
for file in *.lncRNA.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping result:
```
gzip *.lncRNA.psl
mv *.lncRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.lncRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm lncRNA.2bit
rm seqlist_READ?_*_*.fa
rm lncRNA.READ?.*-*.psl
rm lncRNA.READ?_*.out
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 5.3 Gencode transcripts

Create a `.2bit` file with the gencode transcript sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa gencode.2bit
```
Use BWA to map all MiSeq sequences to Gencode transcripts:
```
python make_alignment_scripts.py gencode 100
bash -x script.sh
```
This will run
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa seqlist_READ1_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> gencode.READ1_<number>.out | sort -k 14 | pslRecalcMatch stdin gencode.2bit seqlist_gencode_<start_index>_<end_index>.fa stdout | sort -k 10 > gencode.READ1.<start_index>-<end_index>.psl
```
and
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa seqlist_READ2_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> gencode.READ2_<number>.out | sort -k 14 | pslRecalcMatch stdin gencode.2bit seqlist_gencode_<start_index>_<end_index>.fa stdout | sort -k 10 > gencode.READ2.<start_index>-<end_index>.psl
```
which uses BWA to map each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against gencode transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py gencode
```
generating merged files `<library>.gencode.psl`.

Convert the transcript coordinates to genomic coordinates:
```
for file in *.gencode.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping result:
```
gzip *.gencode.psl
mv *.gencode.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.gencode.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm gencode.2bit
rm seqlist_READ?_*_*.fa
rm gencode.READ?.*-*.psl
rm gencode.READ?_*.out
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

## 5.4 FANTOM-CAT

Create a `.2bit` file with the FANTOM-CAT transcript sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa fantomcat.2bit
```
Use BWA to map all MiSeq sequences to FANTOM-CAT transcripts:
```
python make_alignment_scripts.py fantomcat 100
bash -x script.sh
```
This will run
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa seqlist_READ1_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> fantomcat.READ1_<number>.out | sort -k 14 | pslRecalcMatch stdin fantomcat.2bit seqlist_fantomcat_<start_index>_<end_index>.fa stdout | sort -k 10 > fantomcat.READ1.<start_index>-<end_index>.psl
```
and
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa seqlist_READ2_<start_index>_<end_index>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> fantomcat.READ2_<number>.out | sort -k 14 | pslRecalcMatch stdin fantomcat.2bit seqlist_fantomcat_<start_index>_<end_index>.fa stdout | sort -k 10 > fantomcat.READ2.<start_index>-<end_index>.psl
```
which uses BWA to map each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` against FANTOM-CAT transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py fantomcat
```
generating merged files `<library>.fantomcat.psl`.

Convert the transcript coordinates to genomic coordinates:
```
for file in *.fantomcat.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    pslMap -mapInfo=$library.$target.info $file /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
    rm $library.$target.info
done
```
Store the mapping result:
```
gzip *.fantomcat.psl
mv *.fantomcat.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.fantomcat.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```

Remove the intermediate files:
```
rm fantomcat.2bit
rm seqlist_READ?_*_*.fa
rm fantomcat.READ?.*-*.psl
rm fantomcat.READ?_*.out
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

# 6. Perform alignments to the genome using BWA

Create a list of chromosomes exclude the haplotype sequences with names ending in `_alt`:
```
twoBitInfo /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.2bit stdout | cut -f 1 | grep -v _alt > seqList
```
Create a Fasta file for these chromosomes:
```
twoBitToFa -noMask -seqList=seqList /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.2bit hg38.fa
```
Create the corresponding twoBit file:
```
faToTwoBit hg38.fa hg38.2bit
```
Create an index of the genome for BWA:
```
bwa index hg38.fa
```
Use BWA to map all MiSeq sequences to the genome:
```
python make_alignment_scripts.py genome 100
bash -x script.sh
```
This will run
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y hg38.fa seqlist_READ1_<start_index>_<end_index>.fa | samtools view -F 4 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> genome.READ1_<number>.out | sort -k 14 | pslRecalcMatch stdin hg38.2bit seqlist_READ1_<start_index>_<end_index>.fa stdout | sort -k 10 > genome.READ1.<start_index>-<end_index>.psl
```
and
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y hg38.fa seqlist_READ2_<start_index>_<end_index>.fa | samtools view -F 4 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> genome.READ2_<number>.out | sort -k 14 | pslRecalcMatch stdin hg38.2bit seqlist_READ2_<start_index>_<end_index>.fa stdout | sort -k 10 > genome.READ1.<start_index>-<end_index>.psl
```

which uses BWA to map each unique sequenced read in `seqlist_READ1.fa` and `seqlist_READ2.fa` to the genome. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py genome
```
generating merged files `<library>.genome.psl`. Convert the `.psl` files to `.bam` files:

```
for file in *.genome.psl; do
    IFS=. read library target psl <<< $file
    if [ $psl != "psl" ]; then echo "ERROR"; break; fi
    echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $library.$target.sam
    psl2sam.pl $file >> $library.$target.sam
    cat $library.$target.sam | python create_matepairs.py | samtools fixmate - $library.$target.bam
    python add_targets.py $library $target
    rm $library.$target.sam
done
```

Store the mapping result:
```
gzip *.genome.psl
mv *.genome.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/PSL
mv *.genome.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/BAM
```
Remove the intermediate files:
```
rm hg38.fa
rm hg38.fa.amb
rm hg38.fa.ann
rm hg38.fa.bwt
rm hg38.fa.pac
rm hg38.fa.sa
rm hg38.2bit
rm seqList
rm genome.READ?.*-*.psl
rm genome.READ?_*.out
rm script*.sh
rm script_*.stderr
rm script_*.stdout
```

# 7. Generate the bam file by merging the BWA results

Merge the mapping results for each library:
```
python make_merge_scripts.py 
bash -x script.sh
```

This runs the `mergebam.py` script on each library, generating a `<library>.bam` file for each library:
```
python mergebam.py <library>
```

Remove the intermediate files:
```
rm script*.sh
rm script_*.stderr
rm script_*.stdout
rm seqlist_READ?_*_*.fa
rm seqlist_READ?.fa
rm *.index.txt
```
Move the `.bam` files to `/osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/`:
```
mv c91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv cel_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv lip_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv neg_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```

# 8. Add the sequence data to the BAM file

Add the sequence data to the BAM files:
```
python add_sequences.py c91_r1
python add_sequences.py c91_r2
python add_sequences.py c91_r3
python add_sequences.py cel_r1
python add_sequences.py gfi_r1
python add_sequences.py gfi_r2
python add_sequences.py gfi_r3
python add_sequences.py lip_r1
python add_sequences.py myb_r1
python add_sequences.py myb_r2
python add_sequences.py myb_r3
python add_sequences.py n91_r1
python add_sequences.py n91_r2
python add_sequences.py n91_r3
python add_sequences.py neg_r1
python add_sequences.py nkd_r1
python add_sequences.py nkd_r2
python add_sequences.py nkd_r3
python add_sequences.py t00_r1
python add_sequences.py t00_r2
python add_sequences.py t00_r3
python add_sequences.py t01_r1
python add_sequences.py t01_r2
python add_sequences.py t01_r3
python add_sequences.py t04_r1
python add_sequences.py t04_r2
python add_sequences.py t04_r3
python add_sequences.py t12_r1
python add_sequences.py t12_r2
python add_sequences.py t12_r3
python add_sequences.py t24_r1
python add_sequences.py t24_r2
python add_sequences.py t24_r3
python add_sequences.py t96_r1
python add_sequences.py t96_r2
python add_sequences.py t96_r3
```
generating new BAM files.
Check if the BAM files are consistent:
```
samtools view -h c91_r1.bam > /dev/null
samtools view -h c91_r2.bam > /dev/null
samtools view -h c91_r3.bam > /dev/null
samtools view -h cel_r1.bam > /dev/null
samtools view -h gfi_r1.bam > /dev/null
samtools view -h gfi_r2.bam > /dev/null
samtools view -h gfi_r3.bam > /dev/null
samtools view -h lip_r1.bam > /dev/null
samtools view -h myb_r1.bam > /dev/null
samtools view -h myb_r2.bam > /dev/null
samtools view -h myb_r3.bam > /dev/null
samtools view -h n91_r1.bam > /dev/null
samtools view -h n91_r2.bam > /dev/null
samtools view -h n91_r3.bam > /dev/null
samtools view -h neg_r1.bam > /dev/null
samtools view -h nkd_r1.bam > /dev/null
samtools view -h nkd_r2.bam > /dev/null
samtools view -h nkd_r3.bam > /dev/null
samtools view -h t00_r1.bam > /dev/null
samtools view -h t00_r2.bam > /dev/null
samtools view -h t00_r3.bam > /dev/null
samtools view -h t01_r1.bam > /dev/null
samtools view -h t01_r2.bam > /dev/null
samtools view -h t01_r3.bam > /dev/null
samtools view -h t04_r1.bam > /dev/null
samtools view -h t04_r2.bam > /dev/null
samtools view -h t04_r3.bam > /dev/null
samtools view -h t12_r1.bam > /dev/null
samtools view -h t12_r2.bam > /dev/null
samtools view -h t12_r3.bam > /dev/null
samtools view -h t24_r1.bam > /dev/null
samtools view -h t24_r2.bam > /dev/null
samtools view -h t24_r3.bam > /dev/null
samtools view -h t96_r1.bam > /dev/null
samtools view -h t96_r2.bam > /dev/null
samtools view -h t96_r3.bam > /dev/null
```
This should not show any error messages.
Store the BAM files generated by this script:
```
mv c91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv cel_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv lip_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv neg_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```
